We hope to explore the relative influence of physical traits, environmental conditions and species identity on the growth rate of trees. We have two measures of growth rate - basal area and biomass. A gradient boosted model seems like a good candidate for this work since they:
A gradient boosted machine/model is a machine learning model that uses decision trees to fit the data.
A decision tree first starts with all of the observations, then, from the variables provided, it tries to figure out which variable split would result in the “purest” groupings of the data. So, in this case, it would try to place rows with higher growth rates in one node, and those with lower growth rates in another node.
GBMs are an ensemble of decision trees, nut they are fit
sequentially. We call GBMs an ensemble of weak learners as each
subsequent tree is an attempt to correct the errors of the previous
tree. Thus, while one tree, by itself, can not describe the
relationships, with the use of all the trees, we can. Below is a figure
by Bradly Bohemke that attempts to illustrate how each subsequent tree
improves the fit on the data.
Below is a list of the changes made to the raw data:
Soil Fertility, Light, Temperature, pH, Soil.Humidity.Depth, and Slope.First, we can import data and select the labels needed for plotting later.
# prior error suggests that there are single quotation marks used to parse numbers
# I specify that we should read those as a thousand separator
# we will no longer used the data with the influence of sampling date removed
rgr_msh_julian <- read_csv(here("data/RGR_MSH_PCA_NA.csv"),
guess_max = 10000,
col_types = cols())
# these are the labels for the variables - needed for plotting
labels_rgr_msh <- read_csv(here("data/labels.csv"),
guess_max = 10000,
col_types = cols())
# this helps in listing the variables we actually need to keep, depending on the analyses
PlantTraitsKeep <- labels_rgr_msh %>% filter(Class == 1) %>% select(Feature) %>% reduce(c)
EnvironmentalVariablesKeep <- c( "Soil.Fertility", "Light", "Temperature",
"pH", "Soil.Humidity.Depth ", "Slope")
OthersKeep <- labels_rgr_msh %>% filter(Class%in% c(3,4, 5,7) )%>% select(Feature) %>% reduce(c)
OthersKeep <- c("SampleID", "Site", "Species", "julian.date.2011",
OthersKeep[!OthersKeep%in% c("SampleID", "Site", "Species", "julian.date.2011")])Now, we can create the training and test set - 70:30 is default, usually.
# need a training and test set assess the performance of the model
# a 70:30 split is typical
# first, I will work with the imputed data
set.seed(634)
split_train_test <-
initial_split(
data = rgr_msh_julian,
prop = 0.80)
rgr_msh_na <-
split_train_test %>% training() %>% mutate(Group = "Train")%>%
bind_rows(split_train_test %>% testing() %>% mutate(Group = "Test"))# training the model using H20 - made a function that allows us to do this
# takes ~ 3 minutes
startTime <- Sys.time()
gbmParmBIONoDate <-
h20_gbm_tune(path = here("data/RGRH20BIO.csv"),
status = "BIO_GR",
type = "integer")
endTime <- Sys.time()
endTime-startTimeFirst, we look at the best parameters from tuning.
## $model_id
## [1] "final_grid_model_11"
##
## $training_frame
## [1] "train.hex"
##
## $validation_frame
## [1] "valid.hex"
##
## $score_tree_interval
## [1] 10
##
## $ntrees
## [1] 10000
##
## $max_depth
## [1] 12
##
## $min_rows
## [1] 2
##
## $nbins
## [1] 32
##
## $nbins_cats
## [1] 32
##
## $stopping_rounds
## [1] 5
##
## $stopping_metric
## [1] "MSE"
##
## $stopping_tolerance
## [1] 1e-04
##
## $max_runtime_secs
## [1] 3590.617
##
## $seed
## [1] 1234
##
## $learn_rate
## [1] 0.05
##
## $learn_rate_annealing
## [1] 0.99
##
## $distribution
## [1] "gaussian"
##
## $sample_rate
## [1] 0.9
##
## $col_sample_rate
## [1] 0.93
##
## $col_sample_rate_per_tree
## [1] 0.2
##
## $min_split_improvement
## [1] 1e-08
##
## $histogram_type
## [1] "RoundRobin"
##
## $categorical_encoding
## [1] "Enum"
##
## $calibration_method
## [1] "PlattScaling"
##
## $x
## [1] "Soil.Fertility" "Light" "Temperature" "pH" "Slope"
## [6] "Estem" "Branching.Distance" "Stem.Wood.Density" "Leaf.Area" "LMA"
## [11] "LCC" "LNC" "LPC" "d15N" "t.b2"
## [16] "Ks" "Ktwig" "Huber.Value" "X.Lum" "VD"
## [21] "X.Sapwood" "d13C" "Tree.Age"
##
## $y
## [1] "BIO_GR"
Now, we can build the model.
set.seed(123)
gbm_regressor_bio_residuals <-
gbm(BIO_GR ~ .,
data =
rgr_msh_na %>% filter(Group == "Train")%>% filter(!is.na(BIO_GR))%>%
select(any_of(c(EnvironmentalVariablesKeep, PlantTraitsKeep, "Tree.Age", "BIO_GR"))),
n.trees = 1000,
interaction.depth = 12, # max depth
shrinkage = 0.05, #learning rate
n.minobsinnode = 22, #col_sample_rate
bag.fraction = 0.9, # sample_rate,
verbose = FALSE,
n.cores = NULL,
cv.folds = 5)First, we look at the importance of variables in the model.
Assessing how, when we hold everything else constant, what the relationships are between growth rate and the predictor.
How does the model perform when we use the true individual trait value?
Does the test error change when we use the average for that species?
Let’s explore the interactions in these data.
##
## Kruskal-Wallis rank sum test
##
## data: Value by Class
## Kruskal-Wallis chi-squared = 26.678, df = 3, p-value = 6.879e-06
Now, we plot interactions with values>0.1.
Finally, we compare the relative importance of the various groups - tree age, plant traits, and environmental conditions.
# training the model using H20
startTime <- Sys.time()
gbmParmBAINoDate <-
h20_gbm_tune(path = here("data/RGRH20BAI.csv"),
status = "BAI_GR",
type = "integer")
endTime <- Sys.time()
endTime-startTimeFirst, we look at the best parameters from tuning.
## $model_id
## [1] "final_grid_model_96"
##
## $training_frame
## [1] "train.hex"
##
## $validation_frame
## [1] "valid.hex"
##
## $score_tree_interval
## [1] 10
##
## $ntrees
## [1] 10000
##
## $max_depth
## [1] 6
##
## $min_rows
## [1] 64
##
## $nbins
## [1] 256
##
## $nbins_cats
## [1] 512
##
## $stopping_rounds
## [1] 5
##
## $stopping_metric
## [1] "MSE"
##
## $stopping_tolerance
## [1] 1e-04
##
## $max_runtime_secs
## [1] 3480.334
##
## $seed
## [1] 1234
##
## $learn_rate
## [1] 0.05
##
## $learn_rate_annealing
## [1] 0.99
##
## $distribution
## [1] "gaussian"
##
## $sample_rate
## [1] 0.33
##
## $col_sample_rate
## [1] 0.42
##
## $col_sample_rate_per_tree
## [1] 0.2
##
## $min_split_improvement
## [1] 1e-08
##
## $histogram_type
## [1] "UniformAdaptive"
##
## $categorical_encoding
## [1] "Enum"
##
## $calibration_method
## [1] "PlattScaling"
##
## $x
## [1] "Soil.Fertility" "Light" "Temperature" "pH" "Slope"
## [6] "Estem" "Branching.Distance" "Stem.Wood.Density" "Leaf.Area" "LMA"
## [11] "LCC" "LNC" "LPC" "d15N" "t.b2"
## [16] "Ks" "Ktwig" "Huber.Value" "X.Lum" "VD"
## [21] "X.Sapwood" "d13C" "Tree.Age"
##
## $y
## [1] "BAI_GR"
Now, we can build the model.
set.seed(123)
gbm_regressor_bai_residuals <-
gbm(BAI_GR ~ .,
data =
rgr_msh_na %>% filter(Group == "Train")%>% filter(!is.na(BAI_GR))%>%
select(any_of(c(EnvironmentalVariablesKeep, PlantTraitsKeep, "Tree.Age", "BAI_GR"))),
n.trees = 1000,
interaction.depth = 6, #max depth
shrinkage = 0.05, #learning rate
n.minobsinnode = 10, #col_sample_rate
bag.fraction = 0.33, # sample_rate,
verbose = FALSE,
n.cores = NULL,
cv.folds = 5)First, we look at the importance of variables in the model.
Assessing how, when we hold everything else constant, what the relationships are between growth rate and the predictor.
How does the model perform when we use the true individual trait value?
Does the test error change when we use the average for that species?
Let’s explore the interactions in these data.
##
## Kruskal-Wallis rank sum test
##
## data: Value by Class
## Kruskal-Wallis chi-squared = 19.714, df = 2, p-value = 5.239e-05
## Comparison Z P.unadj
## 1 Environmental Conditions:Environmental Conditions - Plant Traits:Environmental Conditions -2.744831 6.054205e-03
## 2 Environmental Conditions:Environmental Conditions - Plant Traits:Plant Traits -4.630593 3.646202e-06
## 3 Plant Traits:Environmental Conditions - Plant Traits:Plant Traits -2.108346 3.500109e-02
## P.adj
## 1 1.210841e-02
## 2 1.093861e-05
## 3 3.500109e-02
Now, we plot interactions with values>0.10.
Finally, we compare the relative importance of the various groups - tree age, plant traits, and environmental conditions.
startTime <- Sys.time()
gbmParmBIONoDateSpecies <-
h20_gbm_tune(path = here("data/RGRH20BIOSpecies.csv"),
status = "BIO_GR",
type = "integer")
endTime <- Sys.time()
endTime-startTimeFirst, we look at the best parameters from tuning.
## $model_id
## [1] "final_grid_model_97"
##
## $training_frame
## [1] "train.hex"
##
## $validation_frame
## [1] "valid.hex"
##
## $score_tree_interval
## [1] 10
##
## $ntrees
## [1] 10000
##
## $max_depth
## [1] 15
##
## $min_rows
## [1] 2
##
## $nbins
## [1] 1024
##
## $nbins_cats
## [1] 4096
##
## $stopping_rounds
## [1] 5
##
## $stopping_metric
## [1] "MSE"
##
## $stopping_tolerance
## [1] 1e-04
##
## $max_runtime_secs
## [1] 3563.938
##
## $seed
## [1] 1234
##
## $learn_rate
## [1] 0.05
##
## $learn_rate_annealing
## [1] 0.99
##
## $distribution
## [1] "gaussian"
##
## $sample_rate
## [1] 0.81
##
## $col_sample_rate
## [1] 0.91
##
## $col_sample_rate_per_tree
## [1] 0.22
##
## $min_split_improvement
## [1] 1e-06
##
## $histogram_type
## [1] "QuantilesGlobal"
##
## $categorical_encoding
## [1] "Enum"
##
## $calibration_method
## [1] "PlattScaling"
##
## $x
## [1] "Soil.Fertility" "Light" "Temperature" "pH" "Slope" "Species"
## [7] "Tree.Age"
##
## $y
## [1] "BIO_GR"
Now, we can build the model.
set.seed(123)
gbm_regressor_bio_residuals_species <-
gbm(BIO_GR ~ .,
data =
rgr_msh_na %>% filter(Group == "Train")%>% filter(!is.na(BIO_GR))%>%
select(any_of(c(EnvironmentalVariablesKeep, "Species", "Tree.Age", "BIO_GR"))) %>%
mutate(Species = factor(Species)),
n.trees = 1000,
interaction.depth = 15, # max depth
shrinkage = 0.05, #learning rate
n.minobsinnode = 7, #col_sample_rate
bag.fraction = 0.81, # sample_rate,
verbose = FALSE,
n.cores = NULL,
cv.folds = 5)First, we look at the importance of variables in the model.
Assessing how, when we hold everything else constant, what the relationships are between growth rate and the predictor.
How does the model perform when we use the true individual trait value?
Let’s explore the interactions in these data.
##
## Kruskal-Wallis rank sum test
##
## data: Value by Class
## Kruskal-Wallis chi-squared = 0.52641, df = 2, p-value = 0.7686
Now, we plot interactions with values>0.10.
Finally, we compare the relative importance of the various groups - tree age, plant traits, and environmental conditions.
startTime <- Sys.time()
gbmParmBAINoDateSpecies <-
h20_gbm_tune(path = here("data/RGRH20BAISpecies.csv"),
status = "BAI_GR",
type = "integer")
endTime <- Sys.time()
endTime-startTimeFirst, we look at the best parameters from tuning.
## $model_id
## [1] "final_grid_model_56"
##
## $training_frame
## [1] "train.hex"
##
## $validation_frame
## [1] "valid.hex"
##
## $score_tree_interval
## [1] 10
##
## $ntrees
## [1] 10000
##
## $max_depth
## [1] 3
##
## $min_rows
## [1] 1
##
## $nbins
## [1] 512
##
## $nbins_cats
## [1] 512
##
## $stopping_rounds
## [1] 5
##
## $stopping_metric
## [1] "MSE"
##
## $stopping_tolerance
## [1] 1e-04
##
## $max_runtime_secs
## [1] 3585.667
##
## $seed
## [1] 1234
##
## $learn_rate
## [1] 0.05
##
## $learn_rate_annealing
## [1] 0.99
##
## $distribution
## [1] "gaussian"
##
## $sample_rate
## [1] 0.58
##
## $col_sample_rate
## [1] 0.74
##
## $col_sample_rate_per_tree
## [1] 0.76
##
## $min_split_improvement
## [1] 0
##
## $histogram_type
## [1] "UniformAdaptive"
##
## $categorical_encoding
## [1] "Enum"
##
## $calibration_method
## [1] "PlattScaling"
##
## $x
## [1] "Soil.Fertility" "Light" "Temperature" "pH" "Slope" "Species"
## [7] "Tree.Age"
##
## $y
## [1] "BAI_GR"
Now, we can build the model.
set.seed(123)
gbm_regressor_bai_residuals_species <-
gbm(BAI_GR ~ .,
data =
rgr_msh_na %>% filter(Group == "Train")%>% filter(!is.na(BAI_GR))%>%
select(any_of(c(EnvironmentalVariablesKeep, "Species", "Tree.Age", "BAI_GR"))) %>%
mutate(Species = factor(Species)),
n.trees = 1000,
interaction.depth = 3, # max depth
shrinkage = 0.05, #learning rate
n.minobsinnode = 5, #col_sample_rate
bag.fraction = 0.58, # sample_rate,
verbose = FALSE,
n.cores = NULL,
cv.folds = 5)First, we look at the importance of variables in the model.
Assessing how, when we hold everything else constant, what the relationships are between growth rate and the predictor.
How does the model perform?
Let’s explore the interactions in these data.
##
## Kruskal-Wallis rank sum test
##
## data: Value by Class
## Kruskal-Wallis chi-squared = 0.38788, df = 1, p-value = 0.5334
Now, we plot interactions with values>0.10.
Finally, we compare the relative importance of the various groups - tree age, plant traits, and environmental conditions.
# training the model using H20 - made a function that allows us to do this
# takes ~ 5 minutes
startTime <- Sys.time()
gbmParmBIONoDateSpeciesAgeEP <-
h20_gbm_tune(path = here("data/RGRH20BIOSpeciesAgeEP.csv"),
status = "BIO_GR",
type = "integer")
endTime <- Sys.time()
endTime-startTimeFirst, we look at the best parameters from tuning.
## $model_id
## [1] "final_grid_model_95"
##
## $training_frame
## [1] "train.hex"
##
## $validation_frame
## [1] "valid.hex"
##
## $score_tree_interval
## [1] 10
##
## $ntrees
## [1] 10000
##
## $max_depth
## [1] 11
##
## $min_rows
## [1] 2
##
## $nbins
## [1] 512
##
## $nbins_cats
## [1] 4096
##
## $stopping_rounds
## [1] 5
##
## $stopping_metric
## [1] "MSE"
##
## $stopping_tolerance
## [1] 1e-04
##
## $max_runtime_secs
## [1] 3544.164
##
## $seed
## [1] 1234
##
## $learn_rate
## [1] 0.05
##
## $learn_rate_annealing
## [1] 0.99
##
## $distribution
## [1] "gaussian"
##
## $sample_rate
## [1] 0.36
##
## $col_sample_rate
## [1] 0.42
##
## $col_sample_rate_per_tree
## [1] 0.67
##
## $min_split_improvement
## [1] 1e-04
##
## $histogram_type
## [1] "QuantilesGlobal"
##
## $categorical_encoding
## [1] "Enum"
##
## $calibration_method
## [1] "PlattScaling"
##
## $x
## [1] "Soil.Fertility" "Light" "Temperature" "pH" "Slope"
## [6] "Estem" "Branching.Distance" "Stem.Wood.Density" "Leaf.Area" "LMA"
## [11] "LCC" "LNC" "LPC" "d15N" "t.b2"
## [16] "Ks" "Ktwig" "Huber.Value" "X.Lum" "VD"
## [21] "X.Sapwood" "d13C" "Species" "Tree.Age"
##
## $y
## [1] "BIO_GR"
Now, we can build the model.
set.seed(123)
gbm_regressor_bioSpeciesAgeEP <-
gbm(BIO_GR ~ .,
data =
rgr_msh_na %>% filter(Group == "Train")%>% filter(!is.na(BIO_GR))%>%
select(any_of(c(EnvironmentalVariablesKeep, PlantTraitsKeep, "Species",
"Tree.Age", "BIO_GR"))) %>%
mutate(Species = factor(Species)),
n.trees = 1000,
interaction.depth = 11, # max depth
shrinkage = 0.05, #learning rate
n.minobsinnode = 10, #col_sample_rate
bag.fraction = 0.36, # sample_rate,
verbose = FALSE,
n.cores = NULL,
cv.folds = 5)First, we look at the importance of variables in the model.
Assessing how, when we hold everything else constant, what the relationships are between growth rate and the predictor.
How does the model perform?
Let’s explore the interactions in these data.
##
## Kruskal-Wallis rank sum test
##
## data: Value by Class
## Kruskal-Wallis chi-squared = 13.939, df = 3, p-value = 0.002989
Now, we plot interactions with values>0.1.
Finally, we compare the relative importance of the various groups - tree age, plant traits, and environmental conditions.
# training the model using H20
startTime <- Sys.time()
gbmParmBAINoDateSpeciesAgeEP <-
h20_gbm_tune(path = here("data/RGRH20BAISpeciesAgeEP.csv"),
status = "BAI_GR",
type = "integer")
endTime <- Sys.time()
endTime-startTimeFirst, we look at the best parameters from tuning.
## $model_id
## [1] "final_grid_model_92"
##
## $training_frame
## [1] "train.hex"
##
## $validation_frame
## [1] "valid.hex"
##
## $score_tree_interval
## [1] 10
##
## $ntrees
## [1] 10000
##
## $max_depth
## [1] 12
##
## $min_rows
## [1] 4
##
## $nbins
## [1] 128
##
## $nbins_cats
## [1] 2048
##
## $stopping_rounds
## [1] 5
##
## $stopping_metric
## [1] "MSE"
##
## $stopping_tolerance
## [1] 1e-04
##
## $max_runtime_secs
## [1] 3526.383
##
## $seed
## [1] 1234
##
## $learn_rate
## [1] 0.05
##
## $learn_rate_annealing
## [1] 0.99
##
## $distribution
## [1] "gaussian"
##
## $sample_rate
## [1] 0.54
##
## $col_sample_rate
## [1] 0.39
##
## $col_sample_rate_per_tree
## [1] 0.76
##
## $min_split_improvement
## [1] 0
##
## $histogram_type
## [1] "UniformAdaptive"
##
## $categorical_encoding
## [1] "Enum"
##
## $calibration_method
## [1] "PlattScaling"
##
## $x
## [1] "Soil.Fertility" "Light" "Temperature" "pH" "Slope"
## [6] "Estem" "Branching.Distance" "Stem.Wood.Density" "Leaf.Area" "LMA"
## [11] "LCC" "LNC" "LPC" "d15N" "t.b2"
## [16] "Ks" "Ktwig" "Huber.Value" "X.Lum" "VD"
## [21] "X.Sapwood" "d13C" "Species" "Tree.Age"
##
## $y
## [1] "BAI_GR"
Now, we can build the model.
set.seed(123)
gbm_regressor_baiSpeciesAgeEP <-
gbm(BAI_GR ~ .,
data =
rgr_msh_na %>% filter(Group == "Train")%>% filter(!is.na(BAI_GR))%>%
select(any_of(c(EnvironmentalVariablesKeep, PlantTraitsKeep,"Species" ,
"Tree.Age", "BAI_GR")))%>%
mutate(Species = factor(Species)),
n.trees = 1000,
interaction.depth = 12, #max depth
shrinkage = 0.05, #learning rate
n.minobsinnode = 10, #col_sample_rate
bag.fraction = 0.54, # sample_rate,
verbose = FALSE,
n.cores = NULL,
cv.folds = 5)First, we look at the importance of variables in the model.
Assessing how, when we hold everything else constant, what the relationships are between growth rate and the predictor.
How does the model perform?
Let’s explore the interactions in these data.
##
## Kruskal-Wallis rank sum test
##
## data: Value by Class
## Kruskal-Wallis chi-squared = 14.487, df = 3, p-value = 0.002312
Now, we plot interactions with values>0.10
Finally, we compare the relative importance of the various groups - tree age, plant traits, and environmental conditions.